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Abstract 

We present a parallel computation scheme based on the Arnoldi algorithm for exact diag- 
onalization of quantum-electron models. It contains a selective data transferring method 
and distributed storage format for efficient computing of the matrix-vector product on 
distributed computing systems. The performed numerical experiments demonstrated 
good performance and scalability of our eigenvalue solver. The developed technique has 
been applied to investigate the electronic properties of Sr2Ru04 at experimental tem- 
peratures. The role of the spin-flip term in the electronic Hamiltonian was analyzed. 
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1. Introduction 

Dynamical mean-field theory (DMFT) has been successfully used for studying the 
electronic properties of a variety of strongly correlated materials in which the strength of 
the electron-electron interaction energy is comparable to or larger than the kinetic one [l| . 
Since these characteristics lead to a wealth of physical phenomena such as metal insulator 
transitions, exotic magnetic structures, and unconventional superconducting phases Q, 
the DMFT method attracts a great deal of attention. The main underlying idea of this 
approach is the mapping of the lattice problem onto an effective impurity problem which 
in turn is solved numerically exactly. For that one can use different numerical techniques 
such as exact diagonalization (ED), numerical renormalization group (NRG), quantum 
Monte Carlo method with the Hirsch-Fye algorithm (QMC-HF), or other schemes ij. 

Many interesting and promising results were obtained by using QMC methods. For 
instance, a realistic five-band modeling of the strongly correlated materials is mainly 
performed by using QMC Hirsch-Fye method. However, there is a number of technical 
limitations, (i) The simulation temperature of 250 K - 1000 K is much higher than 
experimental one of 10 K - 100 K. (ii) The Coulomb interaction matrix contains density- 
density terms only, (iii) The sign problem prevents scientists from calculating the physical 
properties at sufficiently low temperatures (iv) The QMC approaches require a 
maximum entropy method for analytical continuation of the resulting Green's function 
to real-energy axis. 
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Such problems can be solved by using the ED techniques. The full diagonalization 
method is limited to three-band systems because of the extremely rapid increase of the 
Hilbert space with the number of the electronic states Ng. For instance, full diagonal- 
ization of the Hamiltonian with two impurity orbitals, each coupled to three bath levels 
(Ns = 8), requires diagonalization of matrices with dimension up to 4900 which roughly 
represents the time and storage limit of what is computationally meaningful. 

The diagonalization problem can be simplified by exploiting the extreme sparseness 
of the Hamiltonian matrices and focusing on the limited number of the states with the 
lowest energies. It can be done by using Lanczos 4] or Arnoldi Q approaches. As a 
result the Hamiltonian corresponding to Ns — 12 can be treated at about the same 
computational cost as the full diagonalization for Ns = 8. This improvement allows the 
application of ED/DMFT to realistic three-band and five-band systems However, 
there is memory limitation of the Hamiltonain matrix storage to apply ED/DMFT to 
more complicate physical systems. To date the biggest matrix that can be diagonalized 
by this approach on a modern workstation is about 6435^ x 6435^ which corresponds to 



In this paper we present a sparse matrix- vector product (SpMV) parallelization strat- 
egy for effective solving ED/DMFT equations on distributed computing systems. For 
these purposes we suggest two improvements of the original ED/DMFT method [ij: a 
distributed storage format for sparse large matrix and a selective data transferring. The 
former allows the application of the ED/DMFT method for system with the number 
of electronic states more than Ng ~ 15. The selective data transferring method sub- 
stantially reduces amount of data transmitted between the processors of a distributed 
computing system, which improves SpMV performance. 

The paper organized as follows. In section 2, we presents the basic steps of the 
DMFT theory. In section 3, the developed eigenvalue solver and its parallel performance 
are given. We also give approbation of our computational scheme for a one-band Hubbard 
model on square lattice. For purpose of illustration in section 4 we will use a three-band 
model for Sr2Ru04 compound. 

2. Theory 

One way to describe many-particle phenomena such as high-temperature supercon- 
ductivity, heavy fermions and others is to solve the Hubbard Hamiltonian Q 



where (Qo-) is creation (annihilation) operators of fermion, tij is a hopping integral, 
U is the on-site local Coulomb interaction, is the particle number operator. It is a 
very complicated problem due to the exponential growth of the Hamiltonian and, hence, 
the direct solution of the Hubbard Hamiltonian is only possible for small model clusters 



In case of the real systems one can use the Dynamical Mean- Field Theory [1] in which 
the lattice problem is reduced to a single-site effective problem. Thus, such a mean-field 
consideration leads to freezing spatial correlations. Instead of the Hubbard Hamiltonian 
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we solve the Anderson impurity model which is given by 



HaIM = ^ Vkda [d'^Cka- + C^^da-] + ^ EfccrCfc^Cfco- + + ^l) + U n^Ui, (2) 

where = A'^s — 1 is a number of the bath states, d^{dc^) and c^^ {cka) are creation 
(annihilation) operators for fermions with spin a associated with the impurity site and 
with the state k of the effective bath, respectively, and Ed are the energy of bath and 
impurity, Vkd is a hybridisation integral of impurity and bath states. The main task 
of the ED/DMFT is to diagonalize the impurity Hamiltonian Haim to compute the 
impurity Green function which can be written as 

G = i E., I^^SS [e-^^^ + e-^-^] , (3) 

where | v) (| /i)) and (Ef^) are eigenvectors and eigenvalues of Haim and Z — 
exp(— /3£^^) is the partition function. The exponential factor in ([3]) indicates that at 
large enough (3 only a small number of the lowest eigenstates needs to be calculated [1]. 
In this paper, we develop a high performance parallel technique for solving the eigen- 
value problem of the Anderson impurity model Hamiltonian matrix ([2]) on distributed 
computing systems. 



3. Distributed eigenvalue solver 

A few iterative numerical algorithms such as a power method, a Lanczos method, an 
implicitly restarted Arnoldi method 0], the conjugate gradient (CG) method have 
been proposed to solve the eigenvalue problem for large sparse matrices. Most of them 
are based on iterative multiplication of the Hamiltonian matrix H by the trial vector v. 
Thus they are ideally suited for parallelization. 

An efficient implementation of the Lanczos method for symmetric multiprocessing 



computers with large shared memory was proposed in Ref. ll[ for solving the Heisen- 
berg model. The authors have demonstrated a perfect scalability of their computational 
scheme. It was possible since they used a SMP machine with eight cores which were 
sitting on the same board. As we will show below it is not the case for a distributed 
architecture. 

Recently, the ED results for a five-band DMFT problem were reported by Liebsch 
0. Judging by a short description of the technical details they have also used a SMP 
machine with large shared memory and 32 processors on the board. The aim of our work 
is to modify the exact diagonalization method for solving DMFT equations on distributed 
computing systems. 

3.1. Distributed CSR format and selective data transferring 

Since the Hamiltonian matrix H is extremely sparse the performance of the H x v 
operation strongly depends on the storage format and SpMV algorithm we use. The 
memory subsystem and, more specifically, the memory bandwidth is identified as the 
main performance bottleneck of the SpMV routine [l^, ■ This performance bottleneck 
is due to the fact that SpMV performs 0{nnz) operations where nnz is a number of the 
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nonzero elements of the sparse matrix, which means that most of the data are accessed in 
a streaming manner and there is httle temporal locality. To avoid indirect memory access 
some sparse matrix storage format is used. One of the most widely used storage format 
for sparse matrices is Compressed Sparse Row (CSR) format IJ, ll5| which stores all the 
non-zero elements of the Hamiltonian in contiguous memory locations (Hyaiues array) 
and uses two additional arrays for indexing information: rowJ,nd contains the start of 
each row within the non-zero elements array and coljind contains the column number 
associated with each non-zero element. The size of the Hyaiues and coljind arrays are 
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Figure 1: Example of CSR storage format. The original matrix H is compressed into three arrays 
rowjind, colSnd and H^alue- 



equal to the number of the non-zero elements (nnz), while the size of the rowJ,nd array 
is equal to the number of rows (nrows) plus one. An example of the CSR format for 
a sparse 6x6 matrix is presented in Figure [TJ In order to organize the matrix- vector 
multiplication operation one can use the following listing (Listing [T|) . 



Listing 1: Sparse Matrix- Vector Multiplication using CSR format 



1 


for (i = 


3 ; i <Hdim ; i +- 








2 


for (j 


=row_ind [ i ] 


j <row_ind [ i ; 


j + 


+) 


3 


y[i] 


+=H[j] * V 


col.ind [ j ] ] ; 







In the framework of the standard SpMV procedure with the standard CSR format 
all three arrays and vector should be stored on the same board. It is impossible when 
the size of the Hamiltonian is larger than the memory of a single machine. Therefore 
the first goal of our work is to design method for arrays distribution over CPU mesh to 
overcome the problem of large matrix storage. 

We use the standard CSR storage format as a starting point. The compressed Hamil- 
tonian as well as the accompanied arrays are distributed over processors grid. It reduces 
required memory space of a single CPU (table [T]). The Hamiltonian matrix is stored 
distributively all the time. For this purpose we have developed the library written in 
CH — h for initialization and working with different elements of distributed arrays. 

The second goal of our work is to organize effective and optimal communications 
between processors to reduce amount of transmitted data. To solve this problem we 
have developed communication procedures for operating elements of the matrix stored 
distributively. The subroutines can be divided into two classes. The first one is used for 

4 



CPU-2, 



colJdxi — ( u 4 1 5 ) 
^(0213) 

^lalutj ( ''"^ ) 

i't = (I'o t'l ) ; 



colJdxs — (0415) 
^(0213) 

1/J = [v.i Us) ; 



coLidx2 — (23) ^{01) 
= (u2 I's) ; 



Figure 2: Example of Distributed CSR storage format running on 3 CPU. The arrows denote the 
communications between CPUs. 



exchanging data between CPUs based on one-sided communication functions of MPI. For 
optimal performance and memory usage we used MPLGet function, because in case of 
exchanging large amount of data MPI_Put subroutine needs significantly more memory. 
The second class of the subroutines is directed to determine groups of interacting CPUs. 
In a sense the main advantage of the distributed CRS format we proposed is that each 
CPU is to communicate with only a few CPUs to receive the different parts of the Arnoldi 
vector during simulation. For example, while solving eigenvalue problem for the matrix 
with the size of 165, 636, 900 only about 30 of a 256 CPUs are needed for each CPU to 
receive required parts of the vector. 

The figure m shows an example of distributed CSR format for 6 x 6 matrix in case of 
three processors. One can see that the original compressed matrix (FiglT]) and the array 
col-idx is distributed into 3 arrays. Since the matrix-vector multiplication each CPU 
needs only a small part of the whole vector then the index array can be considerably 
reduced. Thus the approach we proposed naturally leads to reduction of single CPU 
memory requirement for the trial vector storage. CPU 1 and CPU 3 are to transmit the 
different parts of the vector to each other. 

3.2. Performance 

To test the performance of the developed technique we have performed the ED /DMFT 
calculations for a one-band Hubbard model on a square lattice with the on-site Coulomb 
interaction U — 2 eV and the nearest hopping integral t — 0.5 eV (FigI3]). The calculations 
were carried out for Ns—16. The resulting Green function is presented in Fig. [31 One 
can see that there is the famous three peaks structure (quasiparticle peak, lower and 
upper Hubbard bands) that agrees with results of the previous theoretical investigations 

Q. 

Table [T] shows the performance of the modified ED/DMFT calculation scheme on Np 
= 32, 64, 128, 256, and 512 processors. These performance measurements were made 
as follows. The total elapsed time and time of communications between processors were 
measured by MPLWtime function. The required memory was measured by the tools 
installed on a particular parallel cluster system. 

FigHl shows the scaling of the computation time for 1 ED/DMFT iteration with a 
maximum vector length of 11,778,624. One can see that the speedup of our calculation 
scheme is far from the ideal scaling. The main reason for that is time required for 
communication between different CPUs. Another operation of our computational scheme 
which takes much time is the construction of the Hamiltonian matrix (Table [T]). 

We can also compare the speedup of our algorithm in case of the different sizes of 
the Hamiltonian matrix. For A^s=16 the ratio T25q/T^i2 = 1-58 is larger and closer to 
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Figure 3: (left) Schematic representation of the square lattice with the nearest neighbors couplings, 
(right) Spectral functions obtained from ED/DMFT calculations with Ns =16 (gray bold line). The 
dashed brown line corresponds to the non-interacting density of states. 



the ideal one than T32/T64 =1.26 in case of A^s=14. Based on the obtained performance 
data we can estimate the number of processors we need for Ng = 18 and Ng =20. For 
instance, in case of Ng = 18, for the largest sector one needs to store about the 82 billions 
of non-zero elements. It requires about 1.5 Tb of memory and can be solved by using 
2000 cores cluster. Investigation of the systems with Ng = 20 leads to operating about 
1194 billions of the non-zero elements and it is impossible to solve this problem using 
our scheme. 

It is interesting to compare the performance of the developed technique with that 
reported in Ref. [16[. For Ng =12 the authors of the paper have estimated that the time 
of the one iteration with 300 lowest eigenstates was less than 30 min if 16 processors 
were employed in parallel. In our case the similar calculations take us about 45-50 min. 
Such a time difference can be explained by taking into account that the author used a 
workstation with shared memory and processors on the same board. 




no of CPUs no of CPUs 



Figure 4; Scaling of the total computation time for 1 ED/DMFT iteration with number of used CPUs. 
The number of CPU in the speedup dependence (right) is normalized by 32. 

It is important to discuss the possible ways to improve our computation scheme. 



Firstly, The results of the previous work [17| have demonstrated that it would be better 



not to store the Heisenberg matrix, but to evaluate the matrix elements whenever needed. 
Secondly, since we divide the full Hamiltonian matrix into sub-matrices of equal number 
of the elements there is an irregular distribution of the non-zero elements. It results 
in unequal partition of workloads among processors. This load-imbalance problem can 
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Table 1: Performance of the developed eigenvalue solver. Np is a number of processors and nnzmax is a 
maximum number of the non-zero elements per row, M^^^.^ and M^^ta a-re amount of transferred data 
per processor and memory requirement per processor (in MB). Tf^ial^ Tcomm and Tsetup are the total 
computational, communication time and time for generating Hamiltonian matrix (in sec), respectively. 
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be solved by partitioning the Hamiltonian matrix in a computational space where the 
optimal number of the non-zero elements is defined to minimize the load-balance and 
communication costs. 



4. Numerical results 

Using the developed exact diagonalization technique we have investigated the elec- 
tronic structure of Sr2Ru04 compound. This system is of interest due to a number 
of unconventional physical properties. For instance, Sr2Ru04 demonstrates an uncon- 
ventional superconductivity at temperatures below 1 K. Moreover, photoemission and 
optical experiments have shown substantial mass renormalization effects. In the pre- 
vious theoretical investigations the electronic structure of Sr2Ru04 was studied within 
LDA+DMFT method using the QMC-HF as the impurity solver. For instance, Liebsch 



and Lichtenstein [18| have performed multiband DMFT calculations to explain the dis- 
crepancy between photoemission and de Haas- van Alphen data. It was found that there 
is a charge transfer from 'idxz,yz states to id^y states. The similar QMC-HF method was 
used by the authors of the paper [fjij to describe some features of photoemission spectra. 

In previous investigations the lowest simulation temperature was 15 meV, which is 
larger those at which experiments were performed. The imaginary time QMC data were 
analytically continued by the maximum entropy method. Since the QMC-HF/DMFT 
method operates only with density-density interaction terms the spin-flip term in the 
impurity Hamiltonian was not taken into account. At the same time the effect of such 
an interaction on electron spectra was mentioned in a number of works [l8l | . 

To take into account the spin- flip effects in our ED/DMFT calculation we have diag- 
onalizcd the following impurity Hamiltonian: 

ma k<j mka m 

+ ^ [U' - J5aa')nmanm'a' - ^ Mmf'^'nJ-'^m'^'^m't + '^mt'^mi'^ni't'^m'i] ■ (4) 
ni<im' (T(t' nn^in' 



Here dma are annihilation (creation) operators for impurity electrons on orbital m with 
spin (7, U{U') is the intra-orbital (inter-orbital) impurity Coulomb energy, J is an intra- 
atomic exchange integral and J' is the spin-flip interaction. The account of the latter 
interaction is possible in the framework of the exact diagonalization and recently devel- 
oped CT-QMC calculation scheme. In according with Ref. 18] these parameters were 
chosen to be C/ = ?7' = 1.2 eV, J = 0.2 eV and J' — 0.1 eV. The simulation temperature 
was taken to be T =200 K that is close to the experimental values used in Ref.'lS"]. Thus 
for a total of about 20 excited states the Greens functions are calculated via the Lanczos 
procedure. Of course, this number increases at higher temperatures, and if Boltzmann 
factors smaller than 10^ are included for higher precision. 

The resulting Green functions obtained for Ns=15 is presented in FiglS] The spectral 
functions without the spin- flip term agree well with those presented in Ref. [18]. One 
can see that the account of the spin-flip term leads to an additional renormalization of 
the spectrum near the Fermi level. It means that other non-diagonal elements of the 
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Figure 5: Densities of states obtained from DMFT calculations for T = 200 K with (brown and blue 
lines) and without (gray lines) the spin-flip term. 

full U-matrix can play an important role for correct description of the experimentally 
observed spectra. The approach we used here is also well suited for analyzing the ground 
and few excited states. Such an analysis can help us to study the experimentally observed 
superconducting phase. 

5. Conclusions 

To conclude, we developed a distributed storage format and a selective data trans- 
ferring procedure for distributed sparse matrix-vector multiplication. They were imple- 
mented in ED solver for dynamic mean field theory problem and showed good speedup 
and scalability. The obtained numerical results look quite promising. The developed ap- 
proach can be used for solving Heisenberg model on computer clusters with distributed 
memory. 
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